AITopics | realistic evaluation

Realistic Evaluation of Transductive Few-Shot Learning - Supplementary Material

Neural Information Processing SystemsApr-25-2026, 19:42:16 GMT

In the main tables of the paper, we did not include the performances of α-TIM in the standard balanced setting. Here, we emphasize that α-TIM is a generalization of TIM [1] as when α 1 (i.e., the α-entropies tend to the Shannon entropies), α-TIM tends to TIM. Therefore, in the standard setting, where optimal hyper-parameter αis obtained over validation tasks that are balanced (as in the standard validation tasks of the original TIM and the other existing methods), the performance of α-TIM is the same as TIM. When αis tuned on balanced validation tasks, we obtain an optimal value of αvery close to 1, and our α-mutual information approaches the standard mutual information. When the validation tasks are uniformly random, as in our new setting and in the validation plots we provided in the main figure, one can see that the performance of α-TIM remains competitive when we tend to balanced testing tasks (i.e., when a is increasing), but is significantly better than TIM when we tend to uniformly-random testing tasks (a = 1).

artificial intelligence, imbalance, machine learning, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Realistic evaluation of transductive few-shot learning

Neural Information Processing SystemsDec-24-2025, 02:27:45 GMT

Transductive inference is widely used in few-shot learning, as it leverages the statistics of the unlabeled query set of a few-shot task, typically yielding substantially better performances than its inductive counterpart. The current few-shot benchmarks use perfectly class-balanced tasks at inference. We argue that such an artificial regularity is unrealistic, as it assumes that the marginal label probability of the testing samples is known and fixed to the uniform distribution. In fact, in realistic scenarios, the unlabeled query sets come with arbitrary and unknown label marginals. We introduce and study the effect of arbitrary class distributions within the query sets of few-shot tasks at inference, removing the class-balance artefact. Specifically, we model the marginal probabilities of the classes as Dirichlet-distributed random variables, which yields a principled and realistic sampling within the simplex.

name change, realistic evaluation, transductive few-shot learning, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.42)

Add feedback

Realistic Evaluation of Deep Semi-Supervised Learning Algorithms

Neural Information Processing SystemsNov-20-2025, 22:57:03 GMT

Semi-supervised learning (SSL) provides a powerful framework for leveraging unlabeled data when labels are limited or expensive to obtain. SSL algorithms based on deep neural networks have recently proven successful on standard benchmark tasks. However, we argue that these benchmarks fail to address many issues that SSL algorithms would face in real-world applications. After creating a unified reimplementation of various widely-used SSL techniques, we test them in a suite of experiments designed to address these issues. We find that the performance of simple baselines which do not use unlabeled data is often underreported, SSL methods differ in sensitivity to the amount of labeled and unlabeled data, and performance can degrade substantially when the unlabeled dataset contains out-of-distribution examples. To help guide SSL research towards real-world applicability, we make our unified reimplemention and evaluation platform publicly available.

artificial intelligence, machine learning, realistic evaluation, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (1.00)

Add feedback

Over-Relying on Reliance: Towards Realistic Evaluations of AI-Based Clinical Decision Support

Sivaraman, Venkatesh, Morrison, Katelyn, Epperson, Will, Perer, Adam

arXiv.org Artificial IntelligenceApr-11-2025

As AI-based clinical decision support (AI-CDS) is introduced in more and more aspects of healthcare services, HCI research plays an increasingly important role in designing for complementarity between AI and clinicians. However, current evaluations of AI-CDS often fail to capture when AI is and is not useful to clinicians. This position paper reflects on our work and influential AI-CDS literature to advocate for moving beyond evaluation metrics like Trust, Reliance, Acceptance, and Performance on the AI's task (what we term the "trap" of human-AI collaboration). Although these metrics can be meaningful in some simple scenarios, we argue that optimizing for them ignores important ways that AI falls short of clinical benefit, as well as ways that clinicians successfully use AI. As the fields of HCI and AI in healthcare develop new ways to design and evaluate CDS tools, we call on the community to prioritize ecologically valid, domain-appropriate study setups that measure the emergent forms of value that AI can bring to healthcare professionals.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2504.07423

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.15)

Genre: Research Report > Experimental Study (0.47)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Applied AI (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.67)

Add feedback

Realistic Evaluation of Deep Semi-Supervised Learning Algorithms

Neural Information Processing SystemsJan-20-2025, 04:48:00 GMT

Semi-supervised learning (SSL) provides a powerful framework for leveraging unlabeled data when labels are limited or expensive to obtain. SSL algorithms based on deep neural networks have recently proven successful on standard benchmark tasks. However, we argue that these benchmarks fail to address many issues that SSL algorithms would face in real-world applications. After creating a unified reimplementation of various widely-used SSL techniques, we test them in a suite of experiments designed to address these issues. We find that the performance of simple baselines which do not use unlabeled data is often underreported, SSL methods differ in sensitivity to the amount of labeled and unlabeled data, and performance can degrade substantially when the unlabeled dataset contains out-of-distribution examples.

deep semi-supervised learning algorithm, realistic evaluation, unlabeled data, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (1.00)

Add feedback

Realistic evaluation of transductive few-shot learning

Neural Information Processing SystemsOct-10-2024, 07:36:53 GMT

Transductive inference is widely used in few-shot learning, as it leverages the statistics of the unlabeled query set of a few-shot task, typically yielding substantially better performances than its inductive counterpart. The current few-shot benchmarks use perfectly class-balanced tasks at inference. We argue that such an artificial regularity is unrealistic, as it assumes that the marginal label probability of the testing samples is known and fixed to the uniform distribution. In fact, in realistic scenarios, the unlabeled query sets come with arbitrary and unknown label marginals. We introduce and study the effect of arbitrary class distributions within the query sets of few-shot tasks at inference, removing the class-balance artefact.

arbitrary class distribution, realistic evaluation, transductive few-shot learning, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.78)

Add feedback

Reviews: Realistic Evaluation of Deep Semi-Supervised Learning Algorithms

Neural Information Processing SystemsOct-8-2024, 02:44:25 GMT

This paper proposes a systematic evaluation of SSL methods, studies the pitfalls of current approaches to evaluation, and, conducts experiments to show the impact of rigorous validation on kinds of conclusions we can draw from these methods. I really like the paper and read it when it appeared on arXiv back in April. In many places we are lacking these kind of systematic approaches to robust evaluations and it's refreshing to see more of these papers emerging that question the foundation of our validation methodologies and provide a coherent evaluation. Suggestions for improvements: - The paper mainly deals with two image categorisation datasets. While these methods have been studied in many recent SSL papers, they also have their own limitations, some of which is mentioned in the paper. But the main problem is that it restricts them to a single domain which is image categorisation.

artificial intelligence, inductive learning, machine learning, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.40)

Add feedback

Constrained Adaptive Attacks: Realistic Evaluation of Adversarial Examples and Robust Training of Deep Neural Networks for Tabular Data

Simonetto, Thibault, Ghamizi, Salah, Desjardins, Antoine, Cordy, Maxime, Traon, Yves Le

arXiv.org Artificial IntelligenceNov-8-2023

State-of-the-art deep learning models for tabular data have recently achieved acceptable performance to be deployed in industrial settings. However, the robustness of these models remains scarcely explored. Contrary to computer vision, there is to date no realistic protocol to properly evaluate the adversarial robustness of deep tabular models due to intrinsic properties of tabular data such as categorical features, immutability, and feature relationship constraints. To fill this gap, we propose CAA, the first efficient evasion attack for constrained tabular deep learning models. CAA is an iterative parameter-free attack that combines gradient and search attacks to generate adversarial examples under constraints. We leverage CAA to build a benchmark of deep tabular models across three popular use cases: credit scoring, phishing and botnet attacks detection. Our benchmark supports ten threat models with increasing capabilities of the attacker, and reflects real-world attack scenarios for each use case. Overall, our results demonstrate how domain knowledge, adversarial training, and attack budgets impact the robustness assessment of deep tabular models and provide security practitioners with a set of recommendations to improve the robustness of deep tabular models against various evasion attack scenarios.

adversarial example and robust training, constrained adaptive attack, realistic evaluation, (2 more...)

arXiv.org Artificial Intelligence

2311.04503

Genre: Research Report > New Finding (0.53)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (0.73)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Structure-Grounded Pretraining for Text-to-SQL

Deng, Xiang, Awadallah, Ahmed Hassan, Meek, Christopher, Polozov, Oleksandr, Sun, Huan, Richardson, Matthew

arXiv.org Artificial IntelligenceOct-24-2020

Learning to capture text-table alignment is essential for table related tasks like text-to-SQL. The model needs to correctly recognize natural language references to columns and values and to ground them in the given database schema. In this paper, we present a novel weakly supervised Structure-Grounded pretraining framework (StruG) for text-to-SQL that can effectively learn to capture text-table alignment based on a parallel text-table corpus. We identify a set of novel prediction tasks: column grounding, value grounding and column-value mapping, and train them using weak supervision without requiring complex SQL annotation. Additionally, to evaluate the model under a more realistic setting, we create a new evaluation set Spider-Realistic based on Spider with explicit mentions of column names removed, and adopt two existing single-database text-to-SQL datasets. StruG significantly outperforms BERT-LARGE on Spider and the realistic evaluation sets, while bringing consistent improvement on the large-scale WikiSQL benchmark.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2010.12773

Country:

North America > United States > Pennsylvania (0.04)
North America > United States > Ohio (0.04)
North America > United States > New Jersey (0.04)
Europe > Belgium (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.48)

Add feedback

Realistic Evaluation of Deep Semi-Supervised Learning Algorithms

Oliver, Avital, Odena, Augustus, Raffel, Colin A., Cubuk, Ekin Dogus, Goodfellow, Ian

Neural Information Processing SystemsFeb-14-2020, 12:12:13 GMT

Semi-supervised learning (SSL) provides a powerful framework for leveraging unlabeled data when labels are limited or expensive to obtain. SSL algorithms based on deep neural networks have recently proven successful on standard benchmark tasks. However, we argue that these benchmarks fail to address many issues that SSL algorithms would face in real-world applications. After creating a unified reimplementation of various widely-used SSL techniques, we test them in a suite of experiments designed to address these issues. We find that the performance of simple baselines which do not use unlabeled data is often underreported, SSL methods differ in sensitivity to the amount of labeled and unlabeled data, and performance can degrade substantially when the unlabeled dataset contains out-of-distribution examples. To help guide SSL research towards real-world applicability, we make our unified reimplemention and evaluation platform publicly available.

deep semi-supervised learning algorithm, realistic evaluation, unlabeled data, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (1.00)

Add feedback